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j_^ ■ Abstract 

^~^' The goal of decentralized optimization over a network is to optimize a global objective 

■^l-H I formed by a sum of local (possibly nonsmooth) convex functions using only local computation 

f— ^ ■ and communication. It arises in various application domains, including distributed tracking 

and localization, multi-agent co-ordination, estimation in sensor networks, and large-scale op- 
timization in machine learning. We develop and analyze distributed algorithms based on dual 
averaging of subgradients, and we provide sharp bounds on their convergence rates as a function 
of the network size and topology. Our method of analysis allows for a clear separation between 
the convergence of the optimization algorithm itself and the effects of communication constraints 
,S^ • arising from the network structure. In particular, we show that the number of iterations rc- 

jrt I quired by our algorithm scales inversely in the spectral gap of the network. The sharpness of this 

prediction is confirmed both by theoretical lower bounds and simulations for various networks. 
Our approach includes both the cases of deterministic optimization and communication, as well 
as problems with stochastic optimization and/or communication. 
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^ ■ 1 Introduction 

O 

fSJ ■ The focus of this paper is the development and analysis of distributed algorithms for solving con- 

ly^ . vex optimization problems that are defined over networks. Such network-structured optimization 

^D I problems arise in a variety of application domains within the information sciences and engineer- 

ing. For instance, problems such as multi-agent coordination, distributed tracking and localization, 
estimation problems in sensor networks and packet routing are all naturally cast as distributed 
convex minimization |BT89l ILWHS02[ ILOT03[ IRN041 IXBKOT] . Common to these problems is the 
KN ' necessity for completely decentralized computation that is locally light — so as to avoid overbur- 

j^ ■ dening small sensors or flooding busy networks — and robust to periodic link or node failures. As a 

second example, data sets that are too large to be processed quickly by any single processor present 
related challenges. A canonical example that arises in statistical machine learning is the problem 
of minimizing a loss function averaged over a large dataset (e.g., optimization in support vector 
machines [CV95J ) . With terabytes of data, it is desirable to assign smaller subsets of the data to 
different processors, and the processors must communicate to find parameters that minimize the 
loss over the entire dataset. However, the communication should be efficient enough that network 
latencies do not offset computational gains. 

Distributed computation has a long history in optimization. Primal and dual decomposition 
methods that lend themselves naturally to a distributed paradigm have been known for at least 
fifty years, and their behavior is well understood (e.g., |DW60t IBer99] ). The 1980s saw signifi- 
cant interest in distributed detection, consensus, and minimization. The seminal work of Tsitsiklis 
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and colleagues |Tsi84t ITBA861 IBT89| analyzed algorithms for minimization of a smooth function 
/ known to several agents while distributing processing of components of the parameter vector 
X G M". An important special case of network optimization — with much faster convergence rates 
than those known for general distributed optimization — is consensus averaging, where each pro- 
cessor in the network must agree on a single (vector-valued) variable. This is recovered from our 
objective ([T]) by setting fi{x) = H^; — ^i||2- A number of researchers have obtained sharp conver- 
gence results for distributed consensus algorithms by studying network topology and using spectral 
properties of random walks or path averaging arguments on the underlying graph structure (e.g., 
see |BGPS06l IBDTVlOl IDSWOBj and references therein) . Allowing stochastic gradients also lets us 
tackle distributed averaging with noise |XBK07| . Mosk-Aoyama et al. |MARS10| consider a prob- 
lem related to our setup, minimizing Y17=i fii^i) ^^ Xj G M subject to linear equality constraints, 
and they obtain rates of convergence dependent on network-conductance using an algorithm sim- 
ilar to dual decomposition. More recently, a few researchers have shifted focus to problems in 
which each processor locally has its own convex (potentially non-differentiable) objective func- 
tion |NO091 lLO09j . Whereas these initial papers treated the case of unconstrained optimization, 
more recent work by Ram et al. |RNV10J analyzes a projected subgradient algorithm for distributed 
minimization of non-smooth functions with constraints. 

Our paper makes two main contributions. The first contribution is to provide a new simple 
subgradient algorithm for distributed constrained optimization of a convex function; we refer to it as 
a dual averaging subgradient method, since it is based on maintaining and forming weighted averages 
of subgradients throughout the network. This approach is essentially different from previously 
developed methods |NO09t [LO09t [RNVIO] . and these differences facilitate our analysis of network 
scaling issues, meaning how convergence rates depend on network size and topology. Indeed, the 
second main contribution of this paper is a careful analysis that demonstrates a close link between 
convergence of the algorithm and the underlying spectral properties of the network. Our analysis 
splits the convergence rate of the algorithm into two terms: an optimization term and a network 
deviation term. We obtain the optimization penalty using techniques based on the optimization 
literature, specifically building on results due to Nesterov |Nes09j . This splitting approach can also 
be adapted to naturally handle issues such as constrained optimization, stochastic communication, 
and stochastic optimization due to elegant properties of the dual averaging algorithm. On the 
other hand, the network scaling terms are obtained using techniques from analysis of Markov 
chains coupled with the distributed communication protocol. We show that the network deviation 
terms we derive are sharp for our algorithm; in the special case of the consensus problem, these 
terms are known to be near-optimal [BGPS06] . 

By comparison to previous work, our convergence results and proofs are different, and our 
characterization of the network scaling terms are often much stronger. In particular, the conver- 
gence rates given by the papers |NO09t ILO09| grow exponentially in the number of nodes n in 
the network. Nedic et al. [NOOT09| and Ram et al. [RNVIO j provide a much tighter analysis 
that yields convergence rates that scale polynomially in the network size, but are independent 
of the network topology (apart from requiring strong connectedness and degree independent of 
n). Specifically, Corollary 5.5 in the paper [RNVIO] guarantees that their projected subgradient 
algorithm — under the assumptions that the number of time steps is known a priori and the stepsize 
is chosen optimally — obtains an e-optimal solution to the optimization problem in ©(n^/e^) time. 
Since this bound is essentially independent of network topology, it does not capture the intuition 
that distributed algorithms should converge much faster on "well-connected" networks — expander 



graphs being a prime example — than on poorly connected networks (e.g., chains, trees or single 
cycles). Johansson et al. [JRJ 09] analyze a low communication peer-to-peer protocol that attains 
rates dependent on network structure. However, in their algorithm only one node has a current 
parameter value, while all nodes in our algorithm maintain good estimates of the optimum at all 
time. This is important in online, streaming, and control problems where nodes are expected to 
act or answer queries in real time. In additional comparison to previous work, our analysis gives 
network scaling terms that are often substantially sharper. Our development yields an algorithm 
with convergence rate that scales inversely in the spectral gap of the network. By exploiting known 
results on spectral gaps for graphs with n nodes, we show that (disregarding logarithmic factors) our 
algorithm obtains an e-optimal solution in 0(n^/e^) iterations for a single cycle or path, 0{n/e^) 
iterations for a two-dimensional grid, and 0{l/e^) iterations for a bounded degree expander graph. 
Moreover, simulation results show an excellent agreement with these theoretical predictions. 

Our analysis covers several settings for distributed minimization. We begin by studying fixed 
communication protocols, which are of interest in a variety of areas such as cluster computing or 
sensor networks with a fixed hardware-dependent protocol. We also investigate randomized com- 
munication protocols as well as randomized network failures, which are often essential to handle 
gracefully in wireless sensor networks and large clusters with potential node failures. Randomized 
communication also provides an interesting tradeoff between communication savings and conver- 
gence rates. In this setting, we obtain much sharper results than previous work by studying the 
spectral properties of the expected transition matrix of a random walk on the underlying graph. We 
also present an analysis of our algorithm with stochastic gradient information, which is not difficult 
when combined with our initial theorems. We describe an extension for (structured) regularized 
objectives that often arise in machine learning problems in Appendix [Pl 

The remainder of this paper is organized as follows. Section [2] is devoted to a formal statement 
of the problem and description of the dual averaging algorithm, whereas Section [3] states the main 
results and consequences of our paper. Having stated our main results, in Section S] we give a more 
in-depth comparison of our work with other recent work. In Section [U we state and prove basic 
convergence results on the dual averaging algorithm, which we then exploit in Section [6] to derive 
concrete results that depend on the spectral gap of the network. Sections [7] and [8] treat extensions 
with noise, in particular algorithms with noisy communication and stochastic gradient methods 
respectively. In Section [9l we present the results of simulations that confirm the sharpness of our 
analysis. 

Notation: We collect some notation used throughout the paper. We use 1 to denote the all- 
ones vector. We also use standard asymptotic notation for sequences. If an and bn are positive 
sequences, then a„ = 0{bn) means that limsup^ a^/^^n < oo, whereas a„ = 0,{bn) means that 
liminf„a„/6„ > 0. On the other hand, a„ = o(bn) means that lim„a„/6„, = and a„, = oj{bn) 
means that lim„ Un/bn = oo. Finally, we write a„ = @{bn) if «« = C'(^n) and a„ = 0,(bn). 



2 Problem set-up and algorithm 

In this section, we provide a formal statement of the distributed minimization problem, and a 
description of the distributed dual averaging algorithm. 



2.1 Distributed minimization 

We consider an optimization problem based on functions that are distributed over a network. More 
specifically, let G = iV, E) be an undirected graph over the vertex set F = {1, 2, . . . , n} with edge 
set E <Z V X V . Associated with each i & V is convex function /j : M'^ — )• M, and our overarching 
goal is to minimize the sum 
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subject to the constraint that x £ M"^ belongs to some closed convex set X — that is, solve the 
problem 

1 "■ 
min — > fiix) subject to x S ^. (1) 

X n ^-^ 

i=l 

Each function /j is convex and hence sub-differentiable, but need not be smooth. We assume 
without loss of generality that G X, since we can simply translate X. Each node i G V is 
associated with a separate agent, and each agent i maintains its own parameter vector Xi G M'^. 
The graph G imposes communication constraints on the agents: in particular, agent i has local 
access to only the objective function /j and can communicate directly only with its immediate 
neighbors j £ N{i) := {j €V \ {i,j) € E}. 

Problems of this nature arise in a variety of application domains, and as motivation for the 
analysis to follow, let us consider a few here. A first example is a sensor network, in which each 
agent represents a sensor mote, equipped with a radio transmitter for communication, some basic 
sensing devices, and some local memory and computational power. In environmental applications 
of sensor networks, each mote i might take a measurement yi of the temperature, and the global 
objective could be to compute the median of the measurements {yi,y2, ■ ■ ■ ^Vn}- This median 
computation problem can be formulated as minimizing the scalar objective function ^ Y17=i fi{^)i 
where fi{x) = \x — yi\. Similar formulations apply to the problem of computing other statistics 
such as means, variances, quantiles and other M-estimators. 

A second motivating example is the machine learning problem first described in Section [TJ 
In this case, the set X is the parameter space of the statistician or learner. Each function /j is 
the empirical loss over the subset of data assigned to the ith processor, and assuming that each 
subset is of equal size (or that the fi are normalized suitably) , the average / is the empirical loss 
over the entire dataset. Here we use cluster computing as our computational model, where each 
processor is a node in the cluster, and the graph G contains edges between those processors that 
are directly connected with small network latencies. A special case of our optimization problem 
within this computational model is the distributed perceptron, recently considered by McDonald 
et al. [MHMin] . 

2.2 Standard dual averaging 

Our algorithm is based on a projected dual averaging algorithm |Nes09j . designed for minimization 
of a (potentially nonsmooth) convex function / subject to the constraint x ^ X. We begin by 
describing the standard version of this algorithm, and then discuss the extensions for the distributed 
setting of interest in this paper. 



The dual averaging scheme is based on a proximal function ijj : W^ — )• M that is assumed to be 
1-strongly convex with respect to some norm ||-|| — more precisely, the proximal function satisfies 

1 2 

tp{y) > ip{x) + {'V'ip{x), y ~ x) + - \\x ~ y\\ for all x, y G Af. 

In addition, we assume that ip{x) > for all a; G Af and that ■0(0) = Oi these are standard 
assumptions that can be made without loss of generality. Examples of such proximal function and 
norm pairs include: 

• the quadratic ■ip{x) = 2^2' ^^i^h is the canonical proximal function. Clearly 2^2 ~ ^' 
and 2 P 9 is strongly convex with respect to the ^2-iiorm for x G M . 
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the entropic function ip{x) = "^,^^1 Xi logXj — x,, which is strongly convex with respect to the 
£i-norm for x in the probability simplex, {x | x ^ 0, (x, 1) = 1}. 

We assume that each function /j is L-Lipschitz with respect to the same norm ||-|| — that is, 

\fi{x)-fi{y)\<L\\x-y\\ iovx,yeX. (2) 

There are many cost functions fi that satisfy this type of Lipschitz condition. For instance, it holds 
for any convex function on a compact domain X, or for any polyhedral function on an arbitrary 
domain |HUL96a] . Note that the Lipschitz condition ([2]) implies that for any x G Af and any 
subgradient gi G dfi{x), we have 

where ||-||^ denotes the dual norm to ||-||, defined by ||f||^ : = sup|j„|j=i {v,u). 

The dual averaging algorithm generates a sequence of iterates {x{t),z{t)}'^Q contained within 
X xM.'^ according to the following steps. At time step t of the algorithm, it receives a subgradient 
g{t) G df{x{t)), and then performs the updates 

z{t + 1) = z{t) + g{t) and x(i + 1) = n^(z(t + 1), a(t)), (3) 

where {a(i)}^g is a non-increasing sequence of positive stepsizes and 

11^(2:, a) :=argmin< {z,x) -\ il^{x)> (4) 

x€X [ a } 

is a type of projection. The intuition underlying this algorithm is as follows: given the current iterate 
{x{t),z{t)), the next iterate x{t + 1) to chosen to minimize an averaged first-order approximation 
to the function /, while the proximal function ^ and stepsize a{t) > enforce that the iterates 
{x(t)}^Q do not oscillate wildly. The algorithm is similar to the follow the perturbed leader and 
lazy projection algorithms developed in the context of online optimization |KV05] . though in this 
form the algorithm seems to be originally due to Nesterov |Nes09| . In Section [5l we show that a 
simple analysis of the convergence of the above procedure allows us to relate it to the distributed 
algorithm we describe. 



2.3 Distributed dual averaging 

We now consider an appropriate and novel extension of dual averaging to the distributed setting. 
At each iteration t = 1, 2, 3, . . ., the algorithm maintains n pairs of vectors (xi(t), Zi{t)) ^ X x M , 
with the i^^ pair associated with node i ^ V. At iteration t, each node i ^ V computes an element 
gi{t) G dfi{xi{t)) in the subdifferential of the local function /j and receives information about the 
parameters {zj{t)^ j G ^ii)} associated with nodes j in its neighborhood N{i). Its update of 
the current estimated solution Xi{t) is based on a convex combination of these parameters. To 
model this weighting process, let P G Ji'^x^ ^g a matrix of non-negative weights that respects the 
structure of the graph G, meaning that for i j^ j, Pij > only if {i,j) G E. We assume that P is a 
doubly stochastic matrix, so that 

n n 

^ P^- = ^ p^- = 1 for all i G y, ^ P^j = ^ Pij = 1 for all j G V. 

Using this notation, given the non-increasing sequence {a(i)}^o of positive stepsizes, each node 
i G V = {1,2, . . . ,n} performs the updates 

Zi{t + l)= ^ pijZj{t) + Qiit), and (5a) 

jeAf(i) 

Xi(t + l) = n^(zi(t + l),a(i)), (5b) 

where the projection 11^ was defined previously Q. In words, node i computes the new dual 
parameter Zj(t + 1) from a weighted average of its own subgradient gi{t) and the parameters 
{zj{t),j G N{i)} in its own neighborhood N(i), and then computes the next local iterate Xi{t + 1) 
by a projection defined by the proximal function ip and stepsize a{t) > 0. 

In the sequel, we show convergence of the local sequence {xi{t)}'^i to the optimum of ([T]) via 
the running local average 

1 '^ 
Xi(r) = -^x.(t). (6) 

t=i 

Note that this quantity is locally defined at node i and so can be computed in a distributed manner. 
From the definition of updates, it is clear that each element of the sequence {-Zj(i)}^o ^^ essentially 
a weighted average of the gradients seen so far, which is a natural extension of dual averaging. At 
the same time, as we shall see, the averaging of the dual parameters in the sequence {zi{t)}^Q 
allows us to neatly sidestep the complexity arising from non-linearity of projections. We will thus 
be able to generalize the algorithm from equations ()5ap and ()5bp to the case where P is random 
and varies with time as well as when the vectors gi{t) are noisy versions of subgradients, satisfying 
onlyE[g,{t)]Gdfi{x,{t)). 

3 Main results and consequences 

We will now state the main results of this paper and illustrate some of their consequences. We give 
the proofs and a deeper investigation of related corollaries at length in the sections that follow. 
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3.1 Convergence of distributed dual averaging 

We start with a result on the convergence of the distributed dual averaging algorithm that provides a 
decomposition of the error into an optimization term and the cost associated with network commu- 
nication. In order to state this theorem, we define the averaged dual variable z{t) : = - X^^Li ^ii'^)^ 
and we recall the definition ([6]) of the local average Xi{T). 

Theorem 1 (Basic convergence result). Let the sequences {xi{t)}^Q and {zi{t)}^Q be generated 
by the updates Il5a\) and (5^ with step size sequence {a{t)}^Q. Then for any x* G X and for each 
node i G V, we have 

1 T^ ^ 

f{%{T)) - fix*) < ^^^i^ix*) + — J2a{t-1) 

^r T n T 

+ -f E E«w ii^"(^) - 'Ml + ^ E "(*) ii^w - 'M\* ■ (7) 

t=l jr = l t=l 

Theorem [T] guarantees that after T steps of the algorithm, every node i £ V has access to a 
locally defined quantity Xj(T) such that the difference f{xi{T)) — f{x*) is upper bounded by a 
sum of four terms. The first two terms in the upper bound ([7]) are optimization error terms that 
are common to subgradient algorithms. The third and fourth terms are penalties incurred due to 
having different estimates at different nodes in the network, and they measure the deviation of each 
node's estimate of the average gradient from the true average gradienta Thus, roughly, Theorem[T] 
ensures that as long the bound on the deviation \\z{t) — Zi{t)\\^ is tight enough, for appropriately 
chosen a{t) (say a{t) ~ l/^/i), the error of Xi{T) is small uniformly across all nodes i £ V , and 
asymptotically approaches 0. See Theorem [2] in the next section for a precise statement of rates. 

It is worthwhile comparing the optimization error term from the bound ([7|) to known results. 
Subgradient descent on the average function f = ^ Yli=i fi ^^ identical convergence rate, as does 
the randomized version of incremental subgradient descent [NB01| . However, the distributed na- 
ture of the algorithm gives a computational advantage over full gradient descent — the gradient 
computation requires 0(1) computation per computer rather than 0{n) on a single computer. To 
highlight the benefits compared to incremental subgradient descent, consider the common problem 
in machine learning and statistics of minimizing a loss on a large dataset. A randomized incre- 
mental gradient descent method must access random subsets of data at every iteration, leading to 
randomized disk seeks with high latency, which the distributed algorithm avoids. In addition, we 
expect (and empirically see that this is indeed the case) our method to produce more stable iterates, 
as we observe the gradients of all n functions at every round, albeit with a network communication 
lag. 

3.2 Convergence rates and network topology 

We now turn to investigation of the effects of network topology on convergence rates. In this section, 
we assume that the network topology is static and that communication occurs via a fixed doubly 
stochastic weight matrix P at every roundo Since P is doubly stochastic, it has largest singular 



^The fact that the term \\z{t) — Zi{t)\\^ appears an extra time is no significant difficulty, as we wiU bound the 
difference z{t) — Zi{t) uniformly for aU i when giving concrete convergence results. 



^In later sections, we weaken these conditions. 



value cri(P) = 1. As summarized in the following result, the convergence rate of the distributed 
projection algorithm is controlled by the spectral gap 7(P) : = 1 — o"2(P) of the matrix P. 

Theorem 2 (Rates based on spectral gap). Under the conditions and notation of Theorem. [Jl 
suppose moreover that ip{x*) < B? . With step size choice a{t) = ^ ^ — , we have 

f{x,{T)) - fix*) < 8^ ^ "g(^V^) ., for all i € V. 

To the best of our knowledge, this theorem is the first to establish a tight connection between 
the convergence rate of distributed subgradient methods to the spectral properties of the underlying 
network. In particular, the inverse dependence on the spectral gap 1 — (T2{P) is quite natural, since 
it is well-known to determine the rates of mixing in random walks on graphs |LPW08] . and the 
propagation of information in our algorithm is integrally tied to the random walk on the underlying 
graph with transition probabilities specified by P. 

Using Theorem [21 one can derive explicit convergence rates for several classes of interesting 
networks, and Figure [1] illustrates four different graph topologies that are of interest. As a first 
example, the A;-connected cycle in panel (a) is formed by placing n nodes on a circle and connecting 
each node to its k neighbors on the right and left. For small k, the cycle graph is rather poorly 
connected, and our analysis will show that this leads to slower convergence rates than other graphs 
with better connectivity. The grid graph in two dimensions is obtained by connecting nodes to 
their k nearest neighbors in axis-aligned directions. For instance, panel (b) shows an example of a 
degree 4 grid graph in two-dimensions. Both the cycle and grid topologies are possible models for 
clustered computing as well as sensor networks. 

In panel (c), we show a random geometric graph, constructed by placing nodes uniformly at 
random in [0, 1]^ and connecting any two nodes separated by a distance less than some radius r > 0. 
These graphs are used to model the connectivity patterns of devices, such as wireless sensor motes, 
that can communicate with all nodes in some fixed radius ball, and have been studied extensively 
(e.g., |GK00trPen0 3j). There are natural generalizations to dimensions d > 2 as well as to cases in 
which the spatial positions are drawn from a non-uniform distribution. 

Finally, panel (d) shows an instance of a bounded degree expander, which belongs to a special 
class of sparse graphs that have very good mixing properties |Chu98] . Expanders are an attrac- 
tive option for the network topology in distributed computation since they are known to have 
large spectral gaps. For many random graph models, a typical sample is an expander with high 
probability; for instance, a randomly chosen bipartite graph satisfies this property |Alo86| . as do 
random degree regular graphs |FKS89| . In addition, there are several deterministic constructions 
of expanders that are degree regular (see Section 6.3 of Chung [Chu98| for further details). The 
deterministic constructions are of interest because they can be used to design a network, while the 
random constructions are of interest since they are often much simpler. 

In order to state explicit convergence rates, we need to specify a particular choice of the matrix 
P that respects the graph structure. Although many such choices are possible, here we focus on 
the graph Laplacian |Chu98) . First, we let A G M"^" be the symmetric adjacency matrix of the 
undirected graph G, satisfying Aij = 1 when (i,j) G E and Aij = otherwise. For each node 
i £ V, we let 6i = \N{i)\ = J2^=i^ij denote the degree of node i, and we define the diagonal 



(a) (b) (c) (d) 

Figure 1. Illustration of some graph classes of interest in distributed protocols, (a) A 3-connected 
cycle, (b) Two-dimensional grid with 4-connectivity, and non-toroidal boundary conditions, (c) A 
random geometric graph, (d) A random 3-regular expander graph. 



matrix D = diag{(5i, . . . , (5„}. We assume that the graph is connected, so that 5j > 1 for all i, and 
hence D is invertible. With this notation, the (normalized) graph Laplacian is given by 

C{G) = I-D-^/'^AD-^/'^. 

Note that the graph Laplacian C = C{G) is always symmetric, positive semidefinite, and satisfies 
£D^'^I1 = 0. Therefore, when the graph is degree-regular {5i = 5 for all i ^ V), the standard 
random walk with self loops on G given by the matrix P : = I — jtiC is doubly stochastic and is 
valid for our theory. For non-degree regular graphs, we need to make a minor modification in order 
to obtain a doubly stochastic matrix. 

Letting (5max = maxjgy 6i denote the maximum degree, we define the modified matrix 



Pn{G) 



1 



+ 



jiD-A) 



I 



+ 1 



-D^/^CD^n 



(8) 



This matrix is symmetric by construction, and moreover, Yll=ii'^ij ~ A-ij) = Da — Yll=i ^ij = 
for all i £ V, so it is also doubly stochastic. Note that if the graph is 5-regular, then Pn{G) 
is the standard choice above. Modulo a small technical detail about the ratios of Jmax to 5i 
and the eigenvalue order of P (see Sec. 16. 2p . plugging Pn{G) from ([8]) above into Theorem [2] 
immediately relates the convergence of distributed dual averaging to the spectral properties of the 
graph Laplacian, in particular, we have: 



/(x,(r)) - f{x*) = o 



RL lo£ 



{Tn) \ 



Ty/Xn-l{CiG)) 



(9) 



The following result summarizes our conclusions for the choice of stochastic matrix in ([8]) via ([9]) 
in application to different network topologies. 

Corollary 1. Under the conditions of Theorem\^ we have the following convergence rates: 

(a) For k- connected paths and cycles, 



VT k 



(b) For k- connected y/n x ^/n grids, 

RL ^ log (Tn) 



f{%{T))-f{x*) = 



IT k 

(c) For random geometric graphs with connectivity radius r = i7(Y/log ^''n/n) for any e > 0, 



with high-probability, 
(d) For expanders with bounded ratio of minimum to maximum node degree, 

/(x.(T))-/(x*) = o(^log(Tn)). 

Note that up to logarithmic factors, the optimization term in the convergence rate is always of 
the order RL/\/T, while the remaining terms vary depending on the network topology. Instead of 
stating convergence rates, in order to understand scaling issues as a function of network size and 
topology, it can be useful to re-state these results in terms of the number of iterations TG{e',n) 
required to achieve error e for network type G with n nodes. As some special cases. Corollary [1] 
implies the following scalings: 

• for the 1-connected single cycle graph, we have Tcyciei^',n) = ©(n^/e^). 

• for the two-dimensional grid, we have Tgrid(e;?^) = 0{n/e^), and 

• for a bounded degree expander, we have Texp(e;n) = 0{l/e^). 
In general. Theorem [2] implies that at most 

1 

' l-a2{Pn{G))- 

iterations are required to achieve an e-accurate solution when using the matrix Pn{G) previously 
defined in ^. 

It is interesting to ask whether the upper bound (jlOp from our analysis is actually a sharp 
result, meaning that it cannot be improved (up to constant factors). On one hand, it is known that 
(even for centralized optimization algorithms), any subgradient method requires at least Q. (-3-) 
iterations to achieve e-accuracy [NY83J . so that the l/e^ term is unavoidable. The next proposition 
addresses the complementary issue, namely whether the inverse spectral gap term is unavoidable 
for the dual averaging algorithm. For the quadratic proximal function iIj{x) = 2 ||3^|l2) the following 
result establishes a lower bound on the number of iterations in terms of graph topology and network 
structure: 

Proposition 1. Consider the dual averaging algorithm i fJoj) and 156)) with quadratic proximal 
function and communication matrix Pn{G). For any graph G with n nodes, the number of iterations 
TG{c;n) required to achieve a fixed accuracy c> is lower bounded as 

TG{c;n) = ^ 



TG{e;n) = 0(^-- ^77777^) (1°) 



l-a2{Pn{G)) 
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The proof of this result, given in Section [6.31 involves constructing a "hard" optimization problem 
and lower bounding the number of iterations required for our algorithm to solve it. In conjunction 
with Corollary [H Proposition [1] implies that our predicted network scaling is sharp. Indeed, in 
Section [9l we show that the theoretical scalings from Corollary [l] — namely, quadratic, linear, and 
constant in network size n — are well-matched in simulations of our algorithm. 

3.3 Extensions to stochastic communication links 

Our results also extend to the case when the communication matrix P is time- varying and random — 
that is, the matrix P{t) is potentially different for each t and randomly chosen (but it P{t) still 
obeys the constraints imposed by G). Such stochastic communication is of interest for a variety of 
reasons. If there is an underlying dense network topology, we might want to avoid communicating 
along every edge at each round to decrease communication and network congestion. For instance, 
the use of a gossip protocol |BGPS06] . in which one edge in the network is randomly chosen to 
communicate at each iteration, allows for a more refined trade-off between communication cost 
and number of iterations. Communication in real networks also incurs errors due to congestion or 
hardware failures, and we can model such errors by a stochastic process. 

The following theorem provides a convergence result for the case of time- varying random com- 
munication matrices. In particular, it applies to sequences {xi(t)}^g and {zi{t)}^Q generated by 
the dual averaging algorithm with updates ([5a|) and ()5bp with step size sequence {a(i)}^Q, but in 
which pij is replaced with Pij{t). 

Theorem 3 (Stochastic communication). Let {i-*(t)}^g be an i.i.d. sequence of doubly stochastic 
matrices, and define A2(G) := A2(E[P(t)^i-*(t)]). For any x* ^ X and i & V, with probability at 
least 1 — 1/T, we have 



/(x,(r))-/(x*)< 



,, ,, L^ y^ , ,, 3LV61og(T^n) 1 ^\ v- , ^ 



Ta(T) __ _,_..__,.„.„ 



We provide a proof of the theorem in Section [71 Note that the upper bound from the theorem 
is valid for any sequence of non-increasing positive stepsizes {a{t)}^Q. The bound consists of three 
terms, with the first growing and the last two shrinking as the stepsize choice is reduced. If we 
assume that ip{x*) < R^, then we can optimize the tradeoff between these competing terms, and 
we find that the stepsize sequence a{t) ex K 7t '^ approximately minimizes the bound bound in 
the theorem. This yields the scaling 

/(x.(r)) - fix*) < c4£ . , ^°g^^"^ (11) 

^ ^ '^ ^ '- Vt Vl-A2(E[P(t)Tp(t)])' ^ ^ 

for a universal constant c. We can also boost the probability with which this result holds to 1 — 1/T 
for any k > 1 — without modifying the algorithm — at the cost of incurring a slightly higher constant 
penalty in the error bound. 

The setting of stochastic communication for distributed optimization was previously considered 
by Lobel and Ozdaglar |LO09j . They established convergence by assuming lower bounds on the 
entries of P whenever two nodes communicated. As a consequence, their bounds grew exponentially 
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in the number of nodes n in the networko In contrast, the rates given here for stochastic communi- 
cation are directly comparable to the convergence rates in the previous section for fixed transition 
matrices. More specifically, we have inverse dependence on the spectral gap of the expected net- 
work, and consequently polynomial scaling for any network, as well as faster rates dependent on 
network structure. 

3.4 Results for stochastic gradient algorithms 

Finally, none of our convergence results rely on the gradients being correct. Specifically, we 
can straightforwardly extend our results to the case of noisy gradients corrupted with zero-mean 
bounded- variance noise. This setting is especially relevant in situations such as distributed learn- 
ing or wireless sensor networks, when data observed is noisy. Let J-'t be the u-field containing all 
information up to time t, that is, gi{l), ■ ■ ■ ,gi{t) G J-4 and Xj(l), . . . ,Xi{t + 1) £ Tt for all i. We 
define a stochastic oracle that provides gradient estimates satisfying 

E idiit) I J-i_i] E df^ixiit)) and E [||?^(t)||^ | Tt-i] < L" . (12) 

As a special case, this model includes an additive noise oracle that takes an element of the subgra- 
dient dfi{xi{t)) and adds to it bounded variance zero-mean noise. Theorem H] gives our result in 
the case of stochastic gradients. We give the proof and further discussion in Section El noting that 
because we adapt the dual averaging algorithm, the analysis follows quite cleanly from the earlier 
analysis for the previous three theorems. 

Theorem 4 (Stochastic gradient updates). Let the sequence {xj(t)}^]^ he as in TheoremUl ex- 
cept that at each round of the algorithm agent i receives a vector gi{t) from an oracle satisfying 
condition /il2\) . For each i £ V , we have 



E [fixdT))] - fix*) < ^^(x*) + ^ X: a(t - 1) + '-f'-^^^^ E a(t). 

// we assume in addition that X has finite radius R : = sup^i.^^ 11^ ~ ^*ll ^^'^ ^^'^^ II5j(^)IL ^ ^> 
then with probability at least 1 — S, 

rp ^^ rp 

/(x.(T))-/(.*)<^VX-*) + ^E"(*-l) + ^l^^^^E«W+8^^ 

If we further assume that the gradient estimates gi{t) are uncorrelated given Tt-\, then with prob- 
ability at least 1 — d, 




As with the case of stochastic communication covered by Theorem [3l it should be clear that by 
choosing the stepsize a{t) oc j r — ) we have essentially the same optimization error guarantee 
as the bound ([II]), but with A2(E[P(t)^P(t)]) replaced by cr2(P). 



^More precisely, inspection of the constant C in equation (37) of their paper shows that it is of order 7 ^'" ^' , 
where 7 is the lower bound on non-zero entries of P, so it is at least 4"~^. 
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4 Related Work 

Having stated and discussed our main results in the previous section, we can now more explicitly 
compare the results in this paper to those in previous work. Our aim here is to give a clear 
understanding of how our algorithm and results relate to and, in many cases, improve upon prior 
results. Specifically, with the results of Theorem [2] and Corollary [1] in hand, we can more directly 
compare our results to other work. 

As discussed in the introduction, other researchers have designed algorithms for solving the 
problem ^. Most previous work |LO09t [NQ09| INOOTOQt IRNV10| studies convergence of a (pro- 
jected) gradient method in which each node i in the network maintains Xj(f) G Af, and at time t 
performs the update 

Xi(t + 1) = argmin|-|| ^ PjiXj{t) - agi{t)\\A (13) 

jeN(i) 

for gi{t) e dfi{xi{t)). With the update ([13]), Corollary 5.5 in the paper [RNVIO] shows that 



(we use our notation and assumptions from Theorem [2]). The above bound is minimized by setting 
the stepsize a oc -jjj^-y=, giving convergence rate 0{v?''^ RL / \/T) . It is clear that this convergence 
rate is substantially slower than all the rates in Corollary [TJ 

The distributed dual averaging algorithm (f5a|) - ()5bp is quite different from the update (fT3|) . 
The use of the proximal function ip allows us to address problems with non-Euclidean geometry, 
which is useful, for example, for very high-dimensional problems or where the domain X is the 
simplex (e.g. |NY83l Chapter 3]). The differences between the algorithms become more pronounced 
in the analysis. Since we use dual averaging, we can avoid some technical difficulties introduced by 
the projection step in the update (fT3|) . Precisely because of this technical issue, earlier works |NO091 
ILO09J studied unconstrained optimization, and the averaging in Zi{t) seems essential to the faster 
rates our approach achieves as well as the ease with which we can extend our results to stochastic 
settings. 

In other related work, Johansson et al. [JRJ09| establish network-dependent rates for Markov 
incremental gradient descent (MIGD), which maintains a single vector x{t) at all times. A token 
i{t) determines an active node at time t, and at time step i + 1 the token moves to one of its 
neighbors j G N{i{t)), each with probability Pji(t)- Letting gi(^t){t) G 9/j(j)(x(t)), the update is 

-\\x{t) -agiU){t)\L\. (14) 

Johansson et al. show that with optimal setting of a and symmetric transition matrix P, MIGD has 



convergence rate 0(i?Lmaxj v/^^), where T is the return time matrix T = {I — P + TnL~^ /n) ^ . 
In this case, let Ai(P) G [—1,1] denote the ith eigenvalue of P. The eigenvalues of F are thus 1 and 
1/(1 — Xi{P)) for i > 1, and so we have 



n max Tu > tr(r) = 1 + > > max 



A 1 r 1 111 



^l-Ai(P) ll-A2(P)'l-A„(P)J l-a2{P)' 
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Consequently, the bound in Theorem [2] is never weaker, and for certain graphs, our results are 
substantially tighter, as shown in Corollary [TJ For d-dimensional grids (where d > 2) we have 
r(e;n) = 0{it?''^ /e^), whereas MIGD scales as T{e;n) = 0{n/e^). For well-connected graphs, such 
as expanders and the complete graph, the MIGD algorithm scales as T{e;n) = 0(n/e^), essentially 
a factor of n worse than our results. 



5 Basic convergence analysis for distributed dual averaging 

In this section, we prove convergence of the distributed algorithm based on the updates (iSal 
and (|5b[) . We begin in Section 15.11 by defining some auxiliary quantities and establishing lem 
mas useful in the proof, and we prove Theorem [1] in Section [5.21 



5.1 Setting up the analysis 

Using techniques related to those used in past work |NO09] . we establish convergence via two 
auxiliary sequences, given by 

1 " 

^(*) :=-Z]^^(*) ^^^ y(*) :=n^(-^W,a)- (15) 



n 



We begin by showing that the average sum of gradients z{t) evolves in a very simple way. In 
particular, we have 



-. n n 



1=1 j=l 



Consider the right-hand side above, let Z{t) = [zi{t) ■■■ Zn{t)] be the matrix of vectors Zj, and 
denote P = \pi ■ ■ ■ p„]. Since the matrix P is doubly stochastic, we have 



n n 

-Y^y^PiMt) = -Z{t)Pl = -Z{t)l = z{t), 
1=1 j=i 

which yields the evolution 

1 " 
z{t + l) = z{t) + -T.Sjii)- (16) 

j=i 

Consequently, the (negative of the) averaged dual sequence {z(t)}^Q evolves almost like standard 
subgradient descent on the function f{x) = Yl^=i fii^)/^^ ^^^ o^^y difference being gi{t) is a 
subgradient at Xi{t) (which need not be the same as the subgradient gj{t) at Xj{t)). The simple 
evolution (J16p of the averaged dual sequence allows us to avoid difficulties with the non-linearity 
of projection that have been challenging in earlier work. 

Before proceeding with the proof of Theorem [H we state two useful results regarding the con- 
vergence of the standard dual averaging algorithm, though we defer their proofs to Appendix El 
We begin by giving a convergence guarantee for the single-objective form of the dual averaging 
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algorithm. Let {g{t)}'^i C M'^ be an arbitrary sequence of vectors, and consider the sequence 
{x(t)}^i defined by 

x{t + l) := argmin|^(5(s),x) + -|-V;(x)| =n^('^<7(s),a(i)y (17) 

Lemma 2. For any non-increasing sequence {a(t)}^Q of positive stepsizes, and for any x* G X, 
we have 

T T 

Y, {9{t),x{t) -x*)<IY. «(* - 1) Mt)\\l + ^V'(x*). 
tt 2 ^ a{T) 

Next we state a lemma that allows us to restrict our analysis to the easier to analyze centralized 
sequence {y(i)}^o from P^ : 

Lemma 3. Consider the sequences {xi{t)}'^^, {zi{t)}^Q, and {y(t)}^Q defined according to equa- 
tions i5a\) . lisp, and il5\) . Recall that each fi is L-Lipschitz. For each i €z V, we have 



T T T 



Y, f{xi{t)) - fix*) < Y /(y w) - /(^*) + ^ E "W ii^"(*) 
t=i t=i t=i 

Similarly, with the definitions y{T) : = t^ X]i=i vi^) ^'^^ Xi{T) : = y Ylt=i Xi{t), we have 

f{x,{T)) - fix*) < f{y{T)) - fix*) + ^J2 «(*) ll^"(*) - ^«(*)ll* • 
Equipped with these tools, we now turn the proof of Theorem [H 

5.2 Proof of Theorem [1] 

Our proof is based on analyzing the sequence {y(f)}^g. Given an arbitrary x* G X, we have 

T T , n T 



t=l t=l "" i=l t=l "" 1=1 

T n T n J 



n ■' — ' ■' — ' ■' — ' ■' — ' n 

t=l i=l t=l i=l 



where the inequality follows by the L-Lipschitz condition on fi. 

Let giit) S dfiixiit)) be a subgradient of fi at Xj(t). Using convexity, we have the bound 



T n ^ T n 



^EE/^(^^(*))-/^(^*)^^EE(5^(*)'^^(*)-^*)- (19) 

t=l i=l t=l i=l 

Breaking up the right hand side of (JT9]) into two pieces, we obtain 

n n n 

Y imit), x^it) -x*) = J2 {mit),yit) -x*) + Y, {9^it),Xiit) - yit)) . (20) 



1=1 i=l i=l 
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By definition of the updates for z{t) and y{t), we have 



t-l n 



( 1 1 1 

(t) =argmin<^ - XIX] ^^*('^)'^^ + —j^jX^K^) (■ 



Thus, we see that the first term in the decomposition (I20p can be written in the same way as the 
bound in Lemma [21 and as a consequence, we have the bound 



Tin 



1 / \ 1 1 



t=i \i=i 



t=i 

'2 T 



i=l 






< — y a(t - 1) + -j—%b{x*). 



(21) 



t=i 



It remains to control the final two terms in the bounds ([T8]) and (pOj) . Since ||(7i(t)||^ < L by 
assumption, we have 



T n 



T n 



-L lb Y ^ ± lb 

Y.Y.- \\y^^) - ^iWii + - E E ^9:{t).xi{t) - y{t)) 



n 
t=\ 1=1 

^j T n 

t=\ i=\ 
T n 



t=l j=l 



n 



2L 



n 



X E 4(-^-W, «W) - nt,{-z,{t),a{t)) 



t=l 1=1 



By the a-Lipschitz continuity of the projection operator n^(-,a) (see Appendix lA.Sp . we have 



^j- T n T n 

— EE|p^(^"W'«(^))-n^(^^(*)'«) <— EE"Wii^w 

t=l i=l t=l j=l 



Combining this bound with (jlSp and (j2ip yields the running sum bound 

X [/(y(t)) - /(^*)] < -rn^i^*) + ^ E «(* - 1) + V E E «(*) ll^(*) - '1 



t=i 



a{T) 



(22) 



t=i 



i=i j=i 



Applying Lemma[3]to ([22|) gives that X]i=i[/(^«(0) ~ /(^*)] is upper bounded by 



T n 



1 L 2L 

^^v(x*) + — E «(* - 1) + — E E «(*) ii^(*) - ^i wii* + ^ E «(*) ii^(*) 



t=i 



i=i j=i 

Dividing both sides by T and using convexity of / yields the bound ([7]). 



t=i 
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6 Convergence rates, spectral gap, and network topology 

In this section, we will give concrete convergence rates for the distributed dual averaging algorithm 
based on the mixing time of a random walk according to the doubly stochastic matrix P. The 
understanding of the dependence of our convergence rates in terms of the underlying network 
topology is crucial, because it can provide important cues to the system administrator in a clustered 
computing environment or for the locations and connectivities of sensors in a sensor network. We 
begin in Section 16.11 with the proof of Theorem [2j In Section 16.21 we prove the graph-specific 
convergence rates stated in Corollary [H whereas Section 16.31 contains a proof of the lower bound 
stated in Proposition [Ij 

Throughout this section, we adopt the following notational conventions. For an n x n matrix 
B, we call its singular values o"i(-B) > o"2(-B) > • • • > (Tn{B) > 0. For a real symmetric B, we use 
Ai(-B) > \2{B) > ... > XniB) to denote the n real eigenvalues of B. We let A„ = {x G M" | 
X ^ 0, X^ILi -^^ ~ ^} denote the n-dimensional probability simplex. We make frequent use of the 
following standard inequality: for any positive integer t = 1, 2, . . . and any x G A„, 

||P*x - l/nlL,, = - \\P^x - l/nlL < -y/^\\P*x - l/nlL < -a2(P)*^/^. (23) 

II 111 V 2 " '"- 2 " "^ 2 

For a brief review of the relevant standard Perron- Frobenius and matrix theory, we refer the reader 
to Appendix iBl 



6.1 Proof of Theorem [2] 

We focus on controlling the network error term in the bound ([7]), namely the quantity 

J- T n 
IJ 



-EE«(*)ii^"(*) 



^ ^^^^ ^iy-^jw* 

t=l i=l 



Define the matrix $(t, s) = _p*^*+i (in the sequel we allow the stochastic matrix P to change as 
a function of time). Let [^{t, s)]ji be the jth entry of the ith column of <I>(t, s). Then via a bit of 
algebra, we can write 



t / n 



z,{t + l) = Y,Mt,s)],^z,{s)+ Y. (Y,Mt,r)]jigj{r-l)]+gi{t). (24) 

j=l r=s+l ^ j=l ^ 

Clearly the above reduces to the standard update (fSaj) when s = t. Since z(i) evolves simply 
as in p^ . we assume that 2:j(0) = z(0) to avoid notational clutter — we can simply start with 
Zi(0) = — and use ([24l) to see 



i-l n 



{t)-z^{t) = Y,Y.^^|n-m-hs)]Ji)g,{s-l) + i-Y,{g3{t-l)-9^{t-l)))■ (25) 



=1 i=l ^ i=l 
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We use the fact that ||5'i(t)||^ < L for all i and t and ([25]) to see that 

t-l n 



\z{t) - Zi 



i— i n / ^ n \ 

Y, E(Vn - m - 1, s)U)g,{s -l) + ( J29jit--^)- 9^{t - 1) ) 

s=lj=l ^ j=l ^ 

t— 1 n 1 " 

<J2J2Ms-l)\l\{l/n)-Mt-l,s)U + -J2Hit-l)-g,{t-l 



S = l J = l 

i-1 

<Yl\\ [$(t - 1, s)]i - IL/n||^ + 2L. 



i=l 



(26) 



s=l 



Now we break the sum in (j26p into two terms separated by a cutoff point t. The first term 
consists of "throwaway" terms, that is, timesteps s for which the Markov chain with transition 
matrix P has not mixed, while the second consists of steps s for which ||[$(t — l,s)]j — 1/?t-||i is 
small. Note that the indexing on $(t — 1, s) = p*-^+i implies that for small s, $(t — 1, s) is close 
to uniform. From ([23]), \\[<^{t,s)]j - l/n||^ < ^/Ea2{PY~'^'^. Hence, if 



t-s> 



loge ^ 



— — 1 we immediately have ||[$(t,s)]j — Il/?i.||]^ < V^e 



log (72 (P)- 

Thus, by setting e^-*^ = T^fn, for t — s + 1 > °gi ^"j^ , we have 



logf72(P)' 



[$(i,s)],-l/n||i< 



T' 



(27) 



For larger s, we simply have ||['l'(t, s)]j — lL/n||-^ < 2. The above suggests that we split the sum at 
r- We break apart the sum in (p6]) and use (p7|) to see that since t — \ — {t — t) = t 



t 



logTy/n 



log<72(P) 

and there are at most T steps in the summation 



t-i 



t-l-i 



\z{t) - Zi{t)\\^ < L Yj \\^{t-l,s)ei-l/n\\^ + L ^ ||$(i - 1, s)ei - l/n||i + 2L 



s=t-t 



^ \og{TJE) log(rv^) 

< 2L . ^^ .1. [^ + 3L < 2L ^^ ^J + 3L. 



log (72 (P)- 



^2(P) 



(28) 



The last inequality follows from the concavity of log(-), since log(T2(P)~^ > 1 — <72(P). 

Combining ()28p with the running sum bound in ()22p of the proof of the basic theorem, Theo- 
rem [H we immediately see that for x* G Af, 

nn nr\ rj-\ rri 

Yfivit)) - fi^*) < ^V'(^*) + Y^«(* - 1) +6^'E«W + 4^'^^^^^E«W- (29) 



-2(P) tr 



Appealing to Lemma [3] allows us to obtain the same result on the sequence Xi{t) with slightly worse 
constants. Note that Ylt=i^~^ — 2vP— 1. Thus, using the assumption that ip{x*) < B? , using 
convexity to bound f{y{T)) < y Ylt=i fivi^)) (and similarly for Xi{T)), and setting a{t) as in the 
statement of the theorem completes the proof. 
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6.2 Proof of Corollary [T] 

The corollary is based on bounding the spectral gap of the matrix Pn{G) from equation ([8|). 
Lemma 4. The matrix Pn{G) satisfies the hound 

a,{Pn{G)) < maxjl - -^^^A„_i(£), /^Ai(£) - l|. 

•- Omax + 1 Omax + 1 -" 

Proof By a theorem of Ostrowski on congruent matrices (cf. Theorem 4.5.9, |iHJ85] ). we have 



min Jj Afc (£) , max (5j Afc (£) 



(30) 



Since ^D^/^l = 0, we have A„(£) = 0, and so it suffices to focus on \i{D^/'^CD^/'^) and \n^i{D^/'^CD^I'^) 
From the definition ([8]), the eigenvalues of P are of the form 1 — ((^max + l)~^Afc(D^'^£Z)^'^). 
The bound ([30|) coupled with the fact that all the eigenvalues of C are non-negative implies that 
(T2(P) = maxfc<„ {|l ~ ((^max + l)^^Afc(i^^/^-C-D^/^) | } is upper bounded by the larger of 

-A„_i(£) and -Ai(£) — 1. 



Omax i J- Omax ~r J- 

D 

Much of spectral graph theory is devoted to bounding \n-i{C) sufficiently far away from zero, and 
Lemma S] allows us to conveniently leverage such results for bounding the convergence rate of our 
algorithm. 

Note that computing the upper bound in Lemma H] requires controlling both A„_i(£) and 
\i{C). In order to circumvent this complication, we use the well-known idea of a "lazy" random 
walk [Chu98l ILPWOBJ , in which we replace -P by 2 (-^ + -P) • The resulting symmetric matrix has 
the same eigenstructure as P, and moreover, we have 

.,(^„,P,) . .,(^„,P,) -_ ,,(.__1_„V.,„V.) , ,__^,„_,„, (31) 

Consequently, it is sufficient to bound only \n-i{C), which is more convenient from a technical 
standpoint. The convergence rate implied by the lazy random walk through Theorem [2] is no worse 
than twice that of the original walk, which is insignificant for the analysis in this paper. 

We are now equipped to address each of the graph classes covered by Corollary [TJ 

Cycles and paths: Recall the regular ^-connected cycle from Figuredja), constructed by placing 
the n nodes on a circle and connecting every node to k neighbors on the right and left. For this 
graph, the Laplacian £ is a circulant matrix with diagonal entries 1 and off-diagonal non-zero 
entries — 1/2/c. Known results on circulant matrices (see Chapter 3 of Gray [ GraOGj ) imply that it 
has mth eigenvalue 

\m{^) = 1 - ^ XI ^^P {-2mjm/n) - ^ X] ^^P (-27ri(n - j)m/n) = 1 - T X] ^°^ ( "^^ ) ' 
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For m = n — 1 and k = o{n), the last equation can be massaged into |BGPS06| Section VI. A] 

A„_i(£) = l-cos +0^ 



n I \n^ 



By performing a Taylor expansion of cos(-), we see that A„_i(£) = ( ^ ] for A; = o{n). 

Now consider the regular fc-connected path, a path in which each node is connected to the k 
neighbors on its right and left. By computing Cheeger constants (see Lemma[6]in Appendix[C]), we 
see that lik < -y/n, then A„_i(£) = 0(fc^/n^). Note also that for the A;-connected path on n nodes, 
miuj 5i = k and 5max = 2A;. Thus, we can combine the previous two paragraphs with Lemma H] to 
see that for regular /c-connected paths or cycles with k < -^/n. 



a2(P) = l-G(5). (32) 



n^ 



Substituting the bound ([52]) into Theorem [2] yields the claim of Corollary [T]^ a). 

Regular grids: Now consider the case of a -y/n-by--y/n grid, focusing specifically on regular k- 
connected grids, in which any node is joined to every node that is fewer than k horizontal or 
vertical edges away in an axis-aligned direction. In this case, we use results on Cartesian products 
of graphs |Chu98l Section 2.6] to analyze the eigen-structure of the Laplacian. In particular, the 
toroidal y/n-hy-^/n A;-connected grid is simply the Cartesian product of two regular /c-connected 
cycles of -^n nodes. The second smallest eigenvalue of a Cartesian product of graphs is half the 
minimum of second-smallest eigenvalues of the original graphs [Chu98l Theorem 2.13]. Thus, based 
on the preceding discussion of /c-connected cycles, we conclude that if /c = o{^/n), then we have 
A„_i(£) = @{k'^/n). For a non-toroidal -y/n-by-^/n grid (in which the network is not "wrapped" 
on its boundaries, as in Figure [Hb)), we use the previous discussion of regular /c-connected paths, 
since the grid is the Cartesian product of two /c-connected paths of -^/re nodes. We immediately see 
that A„_i(£) = 6(fc^/n). In both cases, for y^-by-y/n /c-connected grids, we use Lemma [Hand 
(I3TD to see that for k < n^/^, 

a2{P) = 1 - e {^\ . (33) 

The result in Corollary [Hb) immediately follows. 

Random geometric graphs: Using the proof of Lemma 10 from Boyd et al. (BGPS06] . we see 



that for any e > 0, if r = ylog '^'^ n/{n-K), then with probability at least 1 — 2/n^ ^, 

min 5i > log ~^'' n — v2c log n and max 6i < log ^"^ n + v2c log n. (34) 

i i 

Thus, letting C be the graph Laplacian of a random geometric graph, if we can bound A„_i(£), 
P4p coupled with Lemma [J] will control the convergence rate of our algorithm. 

Recent work of von Luxburg et al. [vLRHlO] gives concentration results on the second-smallest 
eigenvalue of a geometric graph. In particular, their Theorem 3 says that there are universal 
constants ci, . . . , C5 > such that with probability at least 1 — cinexp(— C2nr^) — C3 exp(— C4nr^)/r^, 

20 



Xn-i{C) > c^r'^. Parsing this a bit, we see that if r = u{y/log n/n), then with exceedingly high 
probabihty, A„_i(£) = Q{r) = uj{logn/n). Using ((Ml) , we see that for r = (log '^^ n/n)^''^, 

"^^"'^^-6(1) and A„_i(£) = f^^^°s'^^" 



maxj di \ n 

with high probability. Combining the above equation with Lemma H] and ()3ip , we have 

..(P) = l-^(i^). (35) 

Thus we have obtained the result of Corollary [T]|^c). Our bounds show that a grid and a random 
geometric graph exhibit the same convergence rate up to logarithmic factors. 

Expanders: The constant spectral gap in expanders |Chu981 Chapter 6] removes any penalty 
due to network communication (up to logarithmic factors), and hence yields Corollary [T][^d) . 

6.3 Proof of Proposition [1] 

We now give a proof of Proposition [H which shows that the dependence of our convergence rates 
on the spectral gap is tight. The proof is based on construction of a set of objective functions fi 
that force convergence to be slow by using the second eigenvector of the communication matrix P. 
Recall that l € M" is the eigenvector of P corresponding to its largest eigenvalue (equal to 
1). Let V G M" be the eigenvector of P corresponding to its second singular value, a2{P)- By 
using the lazy random walk defined in Section 16. 2^ we may assume without loss of generality that 
X2{P) = o'2(-P)- Let w = TT-j, — be a normalized version of the second eigenvector of P, and note 
that X^iLi Wi = 0. Without loss of generality, we assume that there is an index i for which Wi = —1 
(otherwise we can flip signs in what follows); moreover, by re-indexing as needed, we may assume 
that wi = —1. We set X = [—1, 1] C M, and define the univariate functions fi{x) : = (c + Wi)x, so 
that the global problem is to minimize 

- V] fi{x) = - y^ic + Wi)x = ex 

for some constant c > to be chosen. Note that each /j is c + 1-Lipschitz. By construction, we see 
immediately that x* = — 1 is optimal for the global problem. 

Now consider the evolution of the {z(t)}^o '^ ^") ^^ generated by the update ([5al) . By con- 
struction, we have (7i(i) = c + ifj for all t = 1, 2, . . .. Defining the vector g = (cl + w) € R", we 
have the evolution 

t 

Z{t + 1) = Pz{t) +g = Ph{t -l) + Pg + g = ... = Y^p-g 

T = 
t-1 t-1 t-1 

= J2p^{w + c1) = J2 P'^'^ + ctl = Y^ (T2{Pyw + ctl (36) 



since PH = 1. 



T=0 T=0 T=0 
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In order to establish a lower bound, it suffices to show that at least one node is far from the 
optimum after t steps, and we focus on node 1. Since wi = —1, the evolution (i36]l guarantees that 



t-l -, /nU-l 



T = 



z,{t + l) = -Y^a2{Py + ct = ct- ^J'' ■ (37) 



Recalling that 'ip{x) = ^x^ for this scalar setting, we have 

Xi(t + 1) = argmin \ ZiU + l)x + ^ . , x^ } = argmin \ (x + a{t)zi{t + 1)) \ 
xdX ^ 2a(t) J ^(,x ^ J 

Hence xi{t) is the projection of —a{t)zi{t + 1) onto [—1, 1], and unless zi{t) > we have 

/(xi(t))-/(-l)>c>0. 

If t is overly small, the relation (j37p will guarantee that zi{t) < 0, so that xi{i) is far from the 
optimum. If we choose c < 1/3, then a little calculation shows that we require t = Q.{{\ — a2{P))~^) 
in order to drive zi{t) below zero. 

7 Convergence rates for stochastic communication 

In this section, we develop theory appropriate for stochastic and time-varying communication, 
which we model by a sequence {P{t)}'^Q of random matrices. We begin in Section [7. II with basic 
convergence results in this setting, and then prove Theorem [3l Section 17.21 contains a more detailed 
treatment of the case of gossip algorithms, and Section [7.31 contains the setting of edge failures. 

7.1 Basic convergence analysis 

Recall that Theorem [1] involves the sum ^ '^t=i ^^=1 '^(^) 11^(0 ~ -2^«(^)IL- I'^ Section[6l we showed 
how to control this sum when communication between agents occurs on a static underlying network 
structure via a doubly-stochastic matrix P. We now relax the assumption that P is fixed and instead 
let P{t) vary over time. 

7.1.1 Markov chain mixing for stochastic communication 

We use P{t) = \pi{t) ■ ■ ■ Pn{t)\ to denote the doubly stochastic symmetric matrix at iteration t. 
The update employed by the algorithm, modulo changes in P, is given by the usual updates (|5ap 
and (l5b|) — namely, 

n 

Ziit + 1) = Y,Pijit)zjit) + 9i{t), Xi{t + 1) = Ii%{zi{t + 1), a). 
i=i 

In this case, our analysis makes use of the modified definition $(f,s) = P{s)P{s + 1) ■ ■ ■ P(t). 
However, we still have the evolution of z{t -|- 1) = z{t) — ^X^iLiS'ili) from equation (fT6]) . and 
moreover, ([25]) holds essentially unchanged: 

t~l n 1 " 

z{t) - z,{t) = Y, E(Vn - m - 1, s)]j^)gj{s - 1) + - ^ {g^{t - 1) - giit - 1)) . (38) 

s=l j=l j=l 
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To show convergence for the random communication model, we must control the convergence of 
$(t,s) to the uniform distribution. We first claim that 

F[\mt, s)ei - l/n\\^ > e] < e'^X^ (E[P(t)2])*-^+^ , (39) 

which we establish by recalling and modifying a few known results |BGPS06] . 

Let An denote the n-dimensional probability simplex and the vector u{0) G A„ be arbitrary. 
Consider the random sequence {n(t)}^o generated by the recursion u{t + 1) = P{t)u{t). Let 
v{t) : = u{t) — l/n correspond to the portion of u{t) orthogonal to the all Is vector. Calculating 
the second moment of t'(t + 1), we have 

E[{v{t + l),v{t + l)) I v{t)] =K[v{tfP{tfP{t)v{t) I v{t)] =v{tfE[P{tfP{t)]v{t) 

< \\v{t)\\lx4EP{tfP{t)) = \\v{t)\\lX2{EP{tf) 

since (f(i),l) = 0, v{t) is orthogonal to the first eigenvector of P{t), and P{t) is symmetric. 
Applying Chebyshev's inequality yields 



\u{t) - l/n\\2 

IK0)I|2 -' 



E\\v 
< 



< e 



_^\\v{Q)\\lX2{EP{tf 

lh(o)|" 



12 

Replacing u(0) with ei and noting that ||ej — IL/n||2 < 1 yields the claim (j39|) . 

7.1.2 Proof of Theorem [5] 

Using the claim ()39p . we now prove the main theorem of this section, following an argument similar 
to the proof of Theorem [2j We begin by choosing a (non-random) time index t such that for 
t — s > t, with exceedingly high probability, ^{t,s) is close to the uniform matrix 11 /n. We 
then break the summation from 1 to T into two separate terms, separated by the cut-off point t. 
Throughout this derivation, we let A2 = A2(E[P(t)^]), where we have suppressed the dependence of 
A2 on graph structure G to ease notation. 

Using the probabilistic bound ([39]) . note that 



Q 1 _1 

t-s> °^li -1 implies Pn|$(t, s)ei - l/nlL > el < e. 
logAg 

Consequently, if we make the choice 

^ 31og(T^n) 61ogT + 31ogri 61ogT + 31ogn 

logA^^ logAa"^ ~ I-A2 

then we are guaranteed that ift — s>t — 1, then 



¥[\\^{t,s)ei-l/n\\^>l/{T^n)]<{T\fX2~'°''' = {T^nf{e^°>^^^) -i°^^2 = (40) 
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Recalling the bound (j26p . we have 

t-i 



\z{t) - Zi{t)\l <lY^ ||$(t - 1, s)ei - l/n\\^ + 2L 

s=l 

t-1 t-l-t 

= L ^ \\<^{t-l,s)ei-l/n\\^ + L ^ \\^{t - l,s)ei - l/n\\^ + 2L 

s=t-t s=l 

< 2L^^^^^^ +LV^y ||$(t - 1, s)e, - l/nlL +2L. (41) 

1 - Ao ^^11 11^ 



s=l 



It remains to bound the sum 5. For any fixed pair s' < s, since the matrices P{t) are doubly 
stochastic, we have 

||$(i - 1, s')ei - IL/n||2 = ||$(s - 1, s')^{t - 1, s)ei - 'i\^/n\\^ 

<\\\^{s-l,s')\\\^\mt-l,s)e^-l/n\\^ 
< \mt-l,s)ei-l/n\\^, 

where the final inequality uses the bound |||$(s — l,s')|||2 < 1- From the bound ()40p . we have the 
bound \\^{t — l,t — t — l)ej — 1/?t-|L < tt^— with probability at least 1 — l/(r^n). Since s rangi 



between 1 and t — t in the summation S, we conclude that 

6 < WnT —^ = —2—, 
and hence assuming that n > 3, 

lkn«)-=.(t)ll.<i5M^ + LVi^ + 2L 

with probability at least 1 — l/(T^n). Applying the union bound over all iterations t = 1, . . . ,T 
and nodes z = 1, . . . , n, we obtain 



„ , , , ,„ eLlogfr^n) L 

maxmax z t - Zi{t)l > ^\ '- + —^ + 2L 

t<T i<n 1 — A2 1 \/n 

Recalling the master bound from Theorem [1] completes the proof. 



1 

< — . 

- J. 



In the remainder of this section, we give some applications of the stochastic framework outlined 
above, showing a few sampling schemes and giving bounds on their convergence rates. 

7.2 Gossip-like protocols 

Gossip algorithms are procedures for achieving consensus in a network robustly and quickly by 
randomly selecting one edge {i,j) in the network for communication at each iteration [BGPS06) . 
Once nodes i and j are selected, their values are averaged. Gossip algorithms drastically reduce 
communication in the network, yet they still enjoy fast convergence and are robust to changes in 
topology. 
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7.2.1 Partially asynchronous gossip protocols 

In a partially asynchronous iterative method, agents synchronize their iterations |BT89| . This 
is the model of standard gossip protocols, where computation proceeds in rounds, and in each 
round communication occurs on one random edge. In our framework, this corresponds to using the 
random transition matrix P{t) = I — ^{ei — ej){ei — Cj) . It is clear that P{t) P{t) = P{t), since 
P{t) is a projection matrix. 

Let A be the adjacency matrix of the graph G and D be the diagonal matrix of its degrees as 
in Section [6. 2[ At round t, edge {i,j) (with Aij = 1) is chosen with probability 1/ (IL,ylIL). Thus, 

EP(t) = , \ , y I - -id - CiVci - eif = I - —^—-(D - A) 

\ ' ' (i,i):A.,=l \ ' ' 

= 1- —1—D'/\I - D-^I^AD-^/^)D^I^ = 1- —l—D'/^CD^/^ (42) 

{1,A1) ^ ' (1,^1) 

since "^u ,)vl =i(^* ~ ^j){^i ~ ^j)'^ = 2(-D — A). Using an identical argument as that for LemmaU 
we see that ()42|] implies that 

A2(EP(t))<l-0^A„_i(£). 

Note that (1,^1) = (1,1)1), so that for approximately regular graphs, (1,^1) w n6raa.x, and 
miuj 5i/ (1, ^1) ~ 1/n. Thus, at the expense of a factor of roughly 1/n in convergence rate, we can 
reduce the number of messages sent per round from the number of edges in the graph, ©(n^max)) 
to one. In a clustered computing environment with some centralized control, it is possible to select 
more than one edge per round so long as no two edges share vertices (for example, by selecting a 
random maximal matching) and still have P{t) P{t) = P{t). For a (5-regular graph, choosing a 
random maximal matching achieves a spectral gap within constant factors of the spectral gap of 
the underlying graph but uses only Q{\/6) as much communication. 

7.2.2 Totally asynchronous gossip protocol 

Now we relax the assumption that agents have synchronized clocks, so the iterations of the algorithm 
are no longer synchronized. Suppose that each agent has a random clock ticking at real-valued 
times, and at each clock tick, the agent randomly chooses one of its neighbors to communicate with. 
Further assume that each agent computes an iterative approximation to gi G dfi{xi(t)), and that 
the approximation is always unbiased (an example of this is when /j is the sum of several functions, 
and agent i simply computes the subgradient of each function sequentially). We assume that no 
two agents have clocks tick at the same time. This communication corresponds to a gossip protocol 
with stochastic subgradients, and its convergence can be described simply by combining ()42p with 
Theorem HI This type of algorithm is well-suited to completely decentralized environments, such 
as sensor networks. 

7.3 Random edge inclusion and failure 

The two communication "protocols" we analyze now make selection of each edge at each iteration of 
the algorithm independent. We begin with random edge inclusions and follow by giving convergence 
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guarantees for random edge failures. For both protocols, since computation of EP(t)^ is in general 
non-trivial, we work with the model of lazy random walks described in Section 16.21 In the lazy 
random walk model, the communication matrix at each round is ^^ + 2-P(0) which is symmetric 
PSD since ai{P{t)) < 1. Further, for any symmetric PSD stochastic matrix P, P^ :< P. With 
that in mind, we see that ]E(^/ + ^P(t))^ :^ ^I + ^^P{t), and applying Weyl's Theorem for the 
eigenvalues of a Hermitian matrix |HJ851 Theorem 4.3.1], 

A2(E(^/ + ^P(t))') <A2Q/ + ^EP(t)) =i + iA2(EP(t)). (43) 

Thus any bound on A2(EP(t)) provides an upper bound on the convergence rate of the distributed 
dual averaging algorithm with random communication, as in Theorem [31 

Consider the communication protocol in which with probability 1 — (5j/((5max + l), node i does not 
communicate, and otherwise the node picks a random neighbor. If a node i picks a neighbor j, then 
j also communicates back with i to ensure double stochasticity of the transition matrix. We let A{t) 
be the random adjacency matrix at time t. When there is an edge {i,j) in the underlying graph, the 
probability that node i picks edge (i, j) is l/((5max + 1), and thus 'KA{t)ij = ^'^■"ax+i^ _ 'pj^g random 
communication matrix is P{t) = I — ((5max + ^)~^iD{t) — A{t)). Let A and D be the adjacency 
matrix and degree matrix of the underlying (non-stochastic) graph and P be communication matrix 
defined in (ED. With these definitions, EA(t) = f^ax+i ^ ED(t) = fmax+i ^ and A - D = 

(P-/)((5i„ax + l). We have 

EP(t) = / - (5^,. + l)-\EDit) - EAit)) = (j^^^^) ' / + ,f"^^J",|, P, 

and hence 

l-A2(EP(t))= f— + ^ 1-A.(P)). 

Using (jl3|), we see that the spectral gap decreases (and hence convergence rate may slow) by a 
factor proportional to the maximum degree in the graph. This is not surprising, since the amount 
of communication performed decreases by the same factor. 

A related model we can analyze is that of a network in which at every time step of the algorithm, 
an edge fails with probability p independently of the other edges. We assume we are using the model 
of communication in the prequel, so P(t) = I — (5max + ^)~^D{t) + ((5max + 1)~^^(0- Let A, D, 
and P be as before and C be the Laplacian of the underlying graph; we easily have 

EPit) = I - ^^^D - ^^^A = I - -^^D^^CD^I^ = PI + {1- P)P 

"max \ ^ "max ~r ^ "max ~r -L 



and A2(EP(t)) = p + (1 — p)A2(P). Applying (fTTIl . we see that we lose at most a factor of \/l — p 
in the convergence rate. 

8 Stochastic Gradient Optimization 

In this section, we show that the algorithm we have presented naturally generalizes to the case 
in which the agents do not receive true subgradient information but only an unbiased estimate 
of a subgradient of /j. That is, during round t agent i receives a vector gi{t) with Egi{t) = 
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gi{t) G dfi{xi{t)). The proof is made significantly easier by the dual averaging algorithm, which by 
virtue of the simplicity of its dual update smooths the propagation of errors from noisy estimates 
of individual subgradients throughout the network. This was a difficulty in prior work, where 
significant care was needed in the analysis to address passing noisy gradients through nonlinear 
projections [RNVIO] . 

8.1 Proof of Theorem [4] 



thereby obtaining that the running sum S{T) = J2t=i fivi't)) ~ /(^*) is upper bounded as 

T , n T n 



We begin by using convexity and the Lipschitz continuity of the fi (see equations (fT8]l and (fT9|) ) 

T 
U 

t=l i=l t=l 4=1 

T n T n T n 

J2-Y1 (9i{t),Xi{t) -x*)+Y,-Y,L \\y{t) - Xi{t)\\ +Y.-Y. ^9i{t) - m{t),Xi{t) 



n 

t=i i=i 



n 

i=l i=l 



n 

t=i i=i 



(44) 



We bound the first two terms of (|44|) using the same derivation as that for Theorem [T] In par- 
ticular, ^."1=1 {di{t),Xi{t) - X*) = X;r=i {di{t),y{t) - X*) + Xir=i {9i{t),Xi{t) - y{t)), and nothing in 
Lemma [5] assumes that gi{t) is related to fi{xi{t)). So we upper bound the first term in ([H]) with 



t=i 



j=i 



T , n 






n 

t=i i=i 



Holder's inequality implies that E[||^j(i)||^ ll?7(*)IL] ^ -^^ ^'^^ J^ll5i(0IL ^ ^ fo^' ^'^Y hJiS,t. We 
use the two inequalities to bound (|15]) . We have 



E 



i=l * *J=1 

Further, Xi{t) G Tt-i and y{t) G J^t~i by assumption for j G [n] and s < t — 1, so 

E {gi{t),Xi{t) - y{t)) < E ||?,(i)||, ||xi(t) - y{t)\\ = E (E[||5i(t)||, | Ji_i] ||xi(t) - y(t)||) < LE ||x,(t) - y 

Recalling that \\xi{t) — y{t)\\ < a{t) \\z{t) — Zi{t)\\^, we proceed by putting expectations around the 
norm terms in (I26D and (1281) to see that 



4TE||2/(i)-x,(t)||<E||z-(t)-z,(t)|L<^L||[cl>(t-l,s)].-l/n||i + 2L<LM^ 
Coupled with the above arguments, we can bound the expectation of (HH) by 

T n 

+ -EE^EK5.(i)-?4(t),^.(t)-^*)]. 



E 



^f{y{t))-f{x*) 



t=i 



(46) 



t=l i=l 
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Taking the expectation for the final term in the bound (j46p . we recall that Xj(t) € Tt-\i so 

E[(ff^(t) -?i(t),x,(t) -X*)] =E[E[(5,(t) -5,(i),x,(i) -X*) I Ji_i]] 

= E [(E(<7.(t) - m{t) I -^t-i),^.(t) - X*)] = 0, (47) 

which completes the proof of the first statement of the theorem. 

To show that the statement holds with high-probability when X is compact and ||5i(i)||^, < -ZL, 
it is sufficient to establish that the sequence {gi(t) — gi{t),Xi{t) — x*) is a bounded martingale, and 
then apply Azuma's inequality |Azu67j . (Here we are exploiting the fact that under compactness 
and bounded norm conditions, our previous bounds on terms in the decomposition (j45p now hold 
for the analogous terms in the decomposition (j46p without taking expectations.) 

By assumption on the compactness of X and the Lipschitz assumptions on /«, we have 
{9i{t) - gi{t),Xi{t) - X*) < \\gi{t) - gi{t)L \\x^{t) - x*\\ < 2LR. 

Recalling (jl7|) . we conclude that the last sum in the decomposition (Ii6]l is a bounded difference 
martingale, and Azuma's inequality implies that 

T n -, 9 

e 



^Y.^S^it)-g^{t),X^{t)-X*)>e 



t=l i=l 



< 



exp( 



WTn^L^R'^' 



Dividing by T and setting the probability above equal to 6, we obtain that with probability at least 
1-6, 



-, T n /l i 

^^Y.(9^it)-9iit),x^{t)-x*)<iLRJ^ 



t=l i=l 



The second statement of the theorem is now obtained by appealing to Lemma [31 By convexity, we 
have f{xi{T)) < ^ J2t=i f{xi{'^))i thereby completing the proof. 

Proving the last statement of the theorem — the concentration result with uncorrelated noise 
at each node — requires a martingale extension of Bernstein's inequality [Fre75] . Indeed, one form 
of Freedman's inequality states that if Xi, . . . ,Xt is a martingale difference sequence, \Xi\ < B 
uniformly, and V > ^t=i Var(Xj | Tt-i), then for any t;,e > 0, 



VXi > e and V < v] < exp ( ) 

j^ *- - J - \ 2v + 2Be/3j 



To extend the above bound to our setting, we recall that ^ X^ILi idii^) ~di{'t),Xi{t) — x*) is Mar- 
tingale difference sequence uniformly bounded by 2LR. Further, since the expectation is zero, we 
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have 



Var ( i Yl ^9i{t) - W).Xr{t) - x*) \ Ji-i j 

1 " 
-^'Y^'\9i{i) -gi{t),Xi{t) -x*){gj{t) -gj{t),Xj{t) - x*) \ Tt^i 






4e 



n^ 



"^idiit) -gi{t),Xi{t) -x*f I Tt- 

- i=l 



(48) 






i=l 



The decorrelation equality in (|48p follows by our assumption that gi{t) and 5j(t) are uncorrelated 
given Tt-i, and that Xi{t), gi{t), and x* G -Ft-i. Substituting ATL/^B? /n as an upper bound for 
the variance in Freedman's inequality, we have 

To find a 5 so that exp(-) term is less than or equal to 5, we solve 

e^ \ 2 8Li?logi STL^ii^iogi 



Solving the above quadratic in e, we have equality with zero for 



(8/3)Li? log i ± J(8/3)2L2i?2 iog2 i + (32/n)rL2i?2 i^g i 



e 



2 

In particular, noting that \/a+~6 < -y/a + vfc) it is sufficient that 



e > \lR log i + ^LR log ^ + 2V2LrJ- log ^ 
3 3 \ n 

for the inequality in P9]) to be satisfied. Thus with probability at least 1 — (5, 



- E E ^5^(*) - 9i{t),Xi{t) - X*) < SLRlog ^ + 4LrJ- log ^ 

1=1 1=1 

Dividing by T completes the proof of the last statement of Theorem HI 



9 Simulations 

In this section, we report experimental results on the network scaling behavior of the distributed 
dual averaging algorithm as a function of the graph structure and number of processors n. These 
results illustrate the excellent agreement of the empirical behavior with our theoretical predictions. 
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Figure 2. Plot of the function error versus tire number of iterations for a grid graph. Each curve 
corresponds to a grid with a different number of nodes (n G {225,400,600}. As expected, larger 
graphs require more iterations to reach a pre-specified tolerance e > 0, as defined by the iteration 
number T(e; n). The network scaling problem is to determine how T(e; n) scales as a function of n. 



For all experiments reported here, we consider distributed minimization of a sum of hinge loss 
functions; it is this optimization problem that underlies the widely-used support vector machine 
method for classification |CV95j . In a classification problem, we are given n pairs of the form 
{bi,yi) G M'^ X {—1,+!}, where bi S M'^ corresponds to a feature vector and yi £ {~1)+1} is the 
associated label. The goal is to use these samples to estimate a linear classifier, meaning a function 
of the form b i— )• sign (6, x) based on some weight vector x G M'^. In methods based on support 
vector machines, the weight vector is chosen by minimizing a sum of hinge loss functions associated 
with each pair (bi,yi). In particular, given the shorthand notation [c] , := max{0, c}, the hinge 
loss associated with a linear classifier based on x is given by fi{x) = [1 — yi {bi,x)]^. The global 
objective is a sum of n such terms, namely 



fix) 



1 " 



(50) 



i=l 



Setting L = maxj ||6i||2, we note that / is L-Lipschitz and non-smooth at any point with {bi, x) = yi. 
It is common to impose some type of quadratic constraint on the minimization problem ()50p . and 
for the simulations considered here, we set Af = {x G M | ||x|L < 5}. For a given graph size n, 
we form a random instance of a SVM classification problem as follows. For each z = 1, 2, . . . , n, we 
first draw a random vector bi G M from the uniform distribution over the unit sphere. We then 
randomly generate a random Gaussian vector w ~ N{0,ldxd), and then let aj = sign{{w, bi))bi, 
randomly flipping the sign of 5% of the Oj. Note that these choices yield a function / that is Lip- 
schitz with parameter L = 1. Although this is a specific ensemble of problems, we have observed 
qualitatively similar behavior for other problem ensembles. In order to study the effect of graph 
size and topology, we perform simulations with three different graph structures, namely cycles, 
grids, and random 5-regular expanders |FKS89j . with the number of nodes n ranging from 100 to 
900. In all cases, we use the optimal setting of the step size a specified in Theorem[2]and Corollary[T] 
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Figure 3. Each plot shows the number of iterations required to reach a fixed accuracy e (vertical 
axis) versus the network size n (horizontal axis). Each panel shows the same plot for a different 
graph topology: (a) single cycle; (b) two-dimensional grid; and (c) bounded degree expander. Step 
sizes were chosen according to the spectral gap, and dotted lines show predictions of Corollary [TJ 



Figure [2] provides plots of the function error maxj[/(xj(T) — /(x*)] versus the number of it- 
erations for grid graphs with a varying number of nodes n G {225,400,625}. In addition to 
demonstrating convergence, these plots also show how the convergence time scales as a function of 
the graph size n. In particular, for a given class of optimization problems, define TQ{e\ n) to be the 
number of iterations required to obtain e-accurate solution for a graph G with n nodes. As shown 
in Figure [21 for any fixed e > 0, the function Tcie; n) shifts to the right as n is increased, and the 
goal of network scaling analysis is to gain a precise understanding of this shifting. 

As discussed following Corollary [H for cycles, grids, and expanders, we have the following upper 
bounds on the quantity TG(e; n): 



Tcycieie;n) 



or^ 



7'grid(e;"-) 



°'? 



and ^expander ( £ j '^ ) 



-'^ 



(51) 



In Figure EJ we compare these theoretical predictions with the actual behavior of dual subgradi- 
ent averaging. Each panel shows the function TG{e]n) versus the graph size n for the fixed value 
e = 0.1; the three different panels correspond to different graph types: cycles (a), grids (b) and 
expanders (c). In each panel, each point on the blue curve is the average of 20 trials, and the bars 
show standard errors. For comparison, the dotted black line shows the theoretical prediction (I5ip . 
Note that the agreement between the empirical behavior and theoretical predictions is excellent in 
all cases. In particular, panel (a) exhibits the quadratic scaling predicted for the cycle, panel (b) 
exhibits the the linear scaling expected for the grid, and panel (c) shows that expander graphs have 
the desirable property of having constant network scaling. 



Though our focus in this paper is mostly a theoretical one, in our final set of experiments we 
compare the distributed dual averaging method (DDA) that we present to the Markov incremental 
gradient descent (MIGD) method |JRJ09j and the distributed projected gradient method [RNVlOj . 
which seem to have the sharpest convergence rates currently in the literature. In Figure HI we plot 
the quantity Tg{c, n) versus graph size n for DDA and MIGD on grid and expander graphs. We use 
the optimal stepsize a{t) suggested by the analyses for each method. (We do not plot results for the 
distributed projected gradient method |RNV10) because the optimal choice of stepsize according 
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Figure 4. Each plot shows the number of iterations required to reach a fixed accuracy e (vertical 
axis) versus network size n (horizontal axis) for distributed dual averaging (DDA) and Markov 
incremental gradient descent (MIGD) | JR J09) . The panels show the same plot for different graph 
topologies: (a) two-dimensional grid, (b) bounded degree expander. 



to the analysis therein results in such slow convergence that it does not fit on the plots.) Fig. d] 
makes it clear that — especially on graphs with good connectivity properties such as the expander 
in Fig. \Mi>) — the dual averaging algorithm gives improved performance. 



10 Conclusions and Discussion 

In this paper, we proposed and analyzed a distributed dual averaging algorithm for minimizing 
the sum of local convex functions over a network. It is computationally efficient, and we provided 
a sharp analysis of its convergence behavior as a function of the properties of the optimization 
functions and the underlying network topology. Our analysis demonstrates a close connection 
between convergence rates and mixing times of random walks on the underlying graph; such a 
connection is natural given the local and graph-constrained nature of our updates. In addition 
to analysis of deterministic updates, our results also include the case of stochastic communication 
protocols, for instance when communication occurs only along a random subset of the edges at each 
round. Such extensions allow for the design of protocols that provide interesting tradeoffs between 
the amount of communication and convergence rates. We also demonstrate that our algorithm is 
robust to noise by providing an analysis for the case of stochastic optimization with noisy gradients. 
We confirmed the sharpness of our theoretical predictions by implementation and simulation of our 
algorithm. 

There are several interesting open questions that remain to be explored. For instance, it would 
be interesting to analyze the convergence properties of other kinds of network-based optimization 
problems, by combining local information in different structures. It would also be of interest to 
study what other optimization procedures from the standard setting can be converted into efficient 
distributed algorithms to better exploit problem structure when possible. 
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A The Dual Averaging Algorithm 

In this section, we give a simple convergence proof for the basic (non-distributed) dual averaging 
algorithm ([3]). In particular, we recall the updates 

z{t + 1) = z{t) + g(t) and x(t + 1) = argmin | (z(t + l),x) H —-ip{x)\. 

n-ay ^ ait] J 



xex 



Recall our assumptions without any loss of generality that € Af, -0 > and ip{0) = 0. Let 
I{x €z X) be the {0, oo}-valued indicator function for membership in X, and for each a > 0, let -0* 
denote the conjugate dual of the convex function —'ip{x) + 1 (x G X). By definition, the conjugate 
takes the form 

^(•z) = sup|(z,x) i/jix)} =sup\ {z,x) ^p{x)-I{xeX)\. (52) 

xGX ^ a ) X ^ a J 

The definition ([4]) of the projection 11^ shows that the supremum (152p is uniquely attained by 
n^(— z, a). Moreover, for any fixed z, we note that at x = 0, {z, x) — ^il^{x) — I{x^X) = 0. Thus, 
we can restrict the supremum in (j52p to the set 

|x I -ip{x)+I{xeX)- {z,x) <o\ =Xnlx I -i;{x)-{z,x) <o\, 

which is compact since X is closed and ip is strongly convex. Thus, since the supremum is uniquely 
attained and {z,x) is differentiable in z, S/ip'^{—z) = ir^{z,a) [HUL96at Theorem 4.4.2]. 

This fact has two important consequences. First, since the projection is Lipschitz-continuous 
(see Lemma [5]), we have the bound 



||VC(-^)-VC(-^-5)ll= Uliz,a)-U'l^iz + g,a) 



T'/'^-v ^:\ ttV', 



< all^L- 



Consequently, an integration argument (e.g., [Nes041 Lemma 1.2.3]) yields the upper bound 

rai-Z -9)< rai-z) - {g, VC(-^)) + \a Ml . (53) 

The second consequence is that we have 

^(t) = ^r^it-i)i-zit)) = n;t(^(t), a(t - 1)). 
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A.l Proof of Lemma [2] 

To bound the sequence of inner products, we note that for any x* £ X, we have 

T { T ill 

- Yl i9{t),x*) < sup <^ - J^ {g{t),x) - —^Mx) ) + —^Mx*) 

= ^|;*^^^^{-z{T + l)) + ^Hx*)- (54) 

By definition of the conjugate function ^*, whenever we have a{t) < a{t — 1), then we are 
guaranteed that 'ipau^iz) < i^a(t-i)i^) ^^ ^^^ ^ ^ ^"^^ Thus, using the upper bound (j53|) and the 
relations x{t) = V^'^u-^J—z{t)) and z{t + 1) = z{t) + g(t), we obtain 

€w(-^(i + l))<€(t-i)(-^(* + l)) 

< C(t-i)(-^W) - {9{t),x{t)) + la(i - 1) yml ■ 
Rearranging terms yields 

{g{t),x{t)) < V;:(,_,)(-z(t)) - C(t)(-^(* + 1)) + l^it - 1) Uml ■ (55) 

Finally, we combine the upper bound on {g{t),x(t)) from equation (j55p with the earlier bound (|54p . 
thereby obtaining that for any x* G X, the sum S{T) = X^^^^ (5(^)5 2;(*) ~ x*) is upper bounded as 

^ 1 

5(r) < Y, {9it),xit)) + C(r)(-^(r + 1)) + -jTfrMx*) 
t=i ^^ ' 

T T 

<\Y.^{t-i) uml + E [C(t-i)(-^(i)) - Cw(-^(i + 1))] + C(T)(-^(r + 1)) + r^V'(x*) 
t=i t=i *^ ' 

T 

= ^E«(*-i)ii5(*)ii* + ^^(^*)- 

The last line exploited the facts that z{l) = and V'a(O) = 0. This completes the proof of the 
claim. 

A. 2 Proof of Lemma [3] 

Via the L-Lipschitz continuity of the fi , we can write 

T T T 

Y fi^m - fix*) = E /(y(*)) - /(^*) + E /(^^(*)) - /(y(*)) 
t=i t=i t=i 

T T 

<Yf{y{t))-f{x*) + YL\\x^{t)-y{ 
t=i t=i 
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For the second bound, we again use the L-Lipschitz continuity of the fi and the triangle inequahty, 

/(x,(r)) - fix*) = f{y{T)) - fix*) + /(x,(r)) - fiviT)) 

T ^ 
< f{y{T)) - fix*) + L \\xiiT) - yiT)\\ < /(y(r)) - fix*) + - J^ ||x,(i) - y( 

Lipschitz-continuity of the projection (Lemma [5]) shows that ||xi(t) — y(i)|| < Q;(t) ||^(i) — Zi 
which gives both the desired results. 

A. 3 Lipschitz continuity of projections 

The following lemma on the Lipschitz-continuity of the projection operator is well-known, but we 
state and prove it for completeness. 

Lemma 5. For an arbitrary pair u,v ^ W^, we have 



n^(u,a) -U%iv,a] 



< a\\u — v\ 



Proof Lemma [5] is essentially an immediate consequence of the relationship between strong- 
convexity and Lipschitz continuity of the gradient for conjugate functions [HUL96bl Theorem 
X.4.2.1], but we give a short proof for completeness. For an arbitrary pair u,v ^ W^, denote 
w = n^(u, a) and x = Ii^iv,a). By the first-order optimality conditions for convex minimization, 
for any y G X, we have 

u -\ — V'i/'(u'), y — w) > and ( v -\ — V4'ix),y — x) > 0. 
a I \ a I 

Setting y = X and y = w in these two inequalities (respectively) yields 

{au + Vipi'w),x — w) > and {av + V'ipix),w — x) > 0. 
Adding the above two inequalities, we obtain the bound 

(Vipiw) — Vipix),w — x) < a {u — v,x — w) < a \\vi — V2L Ww — x\\ . (56) 

On the other hand the strong convexity oitp implies that ipiw) > Tpix)-'r{ipix),w — x)-|-2l|2; — t(;|| , 
with an analogous bound with the roles of x and w exchanged. Some algebra then leads to 

(Vipiw) — V'4'ix),w — x) > \\w — x\\ , 

which, when combined with ()56p . gives the desired result. D 
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B Background on stochastic matrices 

In this section, we briefly review some well-known properties of stochastic matrices; we refer the 
reader to Chapter 8 of Horn and Johnson [HJ85J for additional detail. For an n x n matrix A, we 
let its singular values be cri{A) > cr2{A) > ■■■ > crn{A), and for a real symmetric A, we define 
the eigenvalues Xi{A) > X2{A) > ••• > A„(^). Let 1 be the all ones vector. In our setting, 
P = [pi • • • pn] G M"^" is a doubly stochastic matrix, so that PI = 1 and l^P = 1^. We have 
(7i{P) = 1, Ai(P^P) = 1, and 1 — (T2{P) is the spectral gap, which is known to determine the 
mixing properties of the Markov chain induced by P |LPW08 ] . 

In order to establish the connection between mixing and spectral gap, define the uniform matrix 
F : = -11^. Observe that F is idempotent (-F^ = F), and moreover it satisfies PF = FP = F. 
By construction, the eigenspectrum of P — F is equal to that of P except that the largest eigenvalue 
1 is removed. Similarly, the eigenspectrum of {P — F) {P — F) = P P — F P — P F + F F = 
P^P — F'^F is identical to that of P^P but with Ai(P^P) = 1 removed. Given these properties, a 
simple calculation yields that for any integer t = 1,2, . . ., we have {P — FY = P^ — F. Consequently, 
for any x G M", we have 

\\P^x-Fx\\^ = \\{P - FYx\\^ < ai{P - F)\\{P - Fy-'^x\\^ < ••• < a2iPY\\x\\^. 
If we take x = e,-, denoting a canonical basis vector for i = 1, . . . , n, then we see that ||P* — LII < 

DO ) 5 ; II iIq^ — 

(T2(P)*. Taking x £ A„, the n-dimensional simplex, gives 

\\P^x - l/n||^^ = - ||P*x - l/n||^ < -V^\\P^x - l/n\\^ < -a2{PYV^, 

which establishes the bound (j23p . (The \/n factor in the bound is standard in the Markov chain 
literature, e.g., |DS91l Proposition 3].) 

C Eigenvalues of paths 

Let G be a graph and 5" be a subset of the nodes in the graph. Let E{S, S'^) denote the set of edges 
crossing between S and S'^, and let the volume of S be the sum of the degrees of the nodes in S, 
that is, vol(S') = J2i£S ^i- '^^^ Cheeger constant of a graph G is defined as 

ha :=mm^ — ,,^; '.' ,, . (57) 

Scymm{vol(S'),vol(5'=)} ^ ' 

If C is the Laplacian of G, then 2/ig > A.„_i(£) > 2^g (^-S-' ^^e Lemma 2.1 and Theorem 2.2 in 
Chung |Chu98j ^. 

Lemma 6. Let G he a k-connected path with n nodes and k < y/n. Then its normalized graph 
Laplacian C satisfies Xn~i{C) = B(A;^/n^). 

Proof We invoke Theorem 4.13 in Chung |Chu98| to conclude that Xn-i{C) = 0{k'^ /n^), since 
G is a subgraph of the A;-connected cycle. It thus suffices to prove that the Cheeger constant is 
lower bounded as he = VL{k/n). 

Let S be the set of nodes achieving the minimum in the definition (157p . To make the rest of the 
proof easier, assume that the degree of each node is 2k. (We may do so without loss of generality, 
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since it only has the effect of increasing vol(5) and vol(S''^) in the Cheeger constant calculation, and 
so any Cheeger constant calculated under this assumption lower bounds the true Cheeger constant.) 
First, note that one of the nodes in S must be against the end of the path — if not, shifting the 
nodes in S in one direction (taking into account that we must pick the direction in which more 
nodes are brought near the end of the path) can only decrease card(£'(S', S^)). Now we show that 
all of the nodes in S must be directly adjacent to one another. Suppose the nodes are not adjacent. 
Since k < i/n, there must be a pair of nodes in S with a distance of at least k. Let i G S^ he 
between those two nodes, and let Si denote the nodes to the left of i and Sr the nodes to the 
right. Collapsing all the nodes in Sr to the rightmost end of the path and all the nodes in Se to the 
leftmost end can only decrease card(S(5, 5"^)). If \S\ > k, then at least one of the sets Sr and Si 
shares k{k — l)/4 edges with S'^. Otherwise, if IS"! < k, then card(ii^(S', S''^)) > k and vo^S") < k'^, 
so card(-E(S', S*^))/ vol(S') > 1/k. Under the assumption k'^ < n, we have 1/k < k/n, from which 
the result follows. D 



D Composite Objectives 

In this section, we show how to generalize the dual averaging algorithm to incorporate composite 
objectives, specifically those of the form f + ip for known ip. Though it is possible to perform similar 
derivations to those in Lemma EJ for brevity we refer to recent work of Xiao [XialOj . Nonetheless, 
the algorithm is conceptually very similar to the dual averaging algorithm (updates ([5a|) and ([Sb]) ). 
and equally as simple to write. We assume that (p is closed convex and non-negative, and X is 
closed. We define the composite projection operator H^ as 



n^ (z) = argmin I {z,x) + tip{x) -\ j-ri^ix) \ . (58) 

^o- •- ait) ) 



xdX •- a(i) 

The mapping H^ is a(t)-Lipschitz with respect to ||-|| and ||-||^, that is, 

||n*,(zi) - n*,(z2)|| < a{t) \\zi - Z2\l . (59) 

As in Lemma O (j59p is a consequence of the fact that the conjugate dual of a l/a(t)-strongly 
convex function has a(t)-Lipschitz continuous gradient with respect to the associated dual norm, 
and the gradient of the conjugate oit(p{x) -\ — Tp:'4^{x) is simply n^(z) |HUL96bl Theorem X.4.2.1]. 
The distributed algorithm based on the update (j58]) is essentially identical to the dual averaging 
algorithm discussed in the main body of the paper. Each agent i maintains the gradient vector 

n 

Zi{t + 1) = Y,Pijit)zj{t) - 9i{t) where Egi{t) e dfi{xi{t)). (60) 

i=i 

The update to Xj(t + 1) is then 

xi(t + i) = n*t(-z,(t + i)). (61) 

As in (fT6|) . we have z{t + 1) = z(t) — ^Y17=i9ji^)- '^^^ following proposition, a simplification 
of [XialOt Section B.2], allows us to give a convergence guarantee for the algorithm described by 
(1601) and dH]). 
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Proposition 2. Let a{t) be a decreasing sequence and g{t) S M*^ be an arbitrary sequence of vectors. 
Ifx{t + 1) = n*^(Et=i 5(i)), then for any x* G X, 

T T 

Y^ {g{t),x{t) - X*) + ^{x{t)) - ^{x*) < -j-Hx*) + i ^ a(t - 1) \\g{t)\\l . 
t=i "*- ^ t=i 

The above proposition, combined with the techniques used to derive Theorem [H allow us to 
easily prove convergence of distributed composite-objective dual averaging. As earlier, let y{t) = 
n*^,(— z(t)), and assume that the fi are L-Lipschitz with respect to ||-||. Then as in (jlSp . (|19p . and 
()20|) . for any x* G X, we immediately have 

J2[fiyit)) + ^iyii))-fi^*)-^i^*)] 
t=i 

t=i \i=i I t=i t=i 1=1 

By definition of y{t), we see that Proposition [2] bounds the above by 

1 1 ^ ^""21 

-— V'lx*) + - 5; a(t - 1)lV 5: J: - ||y(t) - x.(t)|| . 

Finally, we use the fact that the mapping n*^, is a(t)-Lipschitz to see that for the distributed 
composite-objective projection algorithm of (j60p and (j6ip . 



a(r) ' ' 2^^ ^ ' n 

t=\ ^ ' t=i t=i i=i 



11 2T 

Y, f{y{t))+^{y{t))-f{x*)-^{x*) < -^V(x*)+- J2 «(^-l)^'+— E "W E ll^(*) - ■ 

(62) 



Any of the techniques in the prequel can be used to bound (j62p . 
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